This document contains the workflow used in the analysis of T. brucei gene co-expression network analysis. It contains code used in each step of the analysis.
# set working directory
setwd("analysis/")
# ensure results are reproducible
set.seed(1)
# other settings
options(digits = 4)
options(stringsAsFactors = FALSE)
# loading required R packages
library("dupRadar")
library("Rsubread")
library("limma")
library("edgeR")
library("RColorBrewer")
library("ggplot2")
library("gplots")
library("reshape2")
library("ggfortify")
library("xlsx")
library("WGCNA")
library("flashClust")
library("igraph")
library("tidyverse")
# set number of threads to allow in WGCNA analysis
allowWGCNAThreads(nThreads=20)
Data used in this study is obtained from European Nucleotide Archive under accession number SRP002243 and SRR965341.
First, metadata for the data is obtained from EBI as follows:
#Obtain metadata information for the data used in this study from ENA and SRA databases.
# ENA metadata
# code adapted from: https://wiki.bits.vib.be/index.php/Download_read_information_and_FASTQ_data_from_the_SRA
accession <- "SRP002243" # similarly, obtain data for SRR965341
ena.url <- paste("http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=",
accession,
"&result=read_run",
"&fields=run_accession,library_name,",
"read_count,fastq_ftp,fastq_aspera,",
"fastq_galaxy,sra_ftp,sra_aspera,sra_galaxy,",
"&download=text",
sep="")
ENA.metadata <- read.table(url(ena.url), header=TRUE, sep="\t")
# SRA metadata
SRA.metadata <- read.table("../data/SraRunTable.metadata.txt", header = TRUE, sep = "\t")
# create a text file with urls to fastq files in ENA database
fastq.urls <- ENA.metadata[grepl("fastq_ftp", names(ENA.metadata))]
write.csv(fastq.urls, file="../data/fastq.urls.txt", eol = "\r\n", quote = FALSE, row.names = FALSE)
# obtain sample metadata to be used later in analysis in R.
matches <- c("Run","Library_Name","Sample_Name")
sample.metadata <- SRA.metadata[grepl(paste(matches, collapse="|"), names(SRA.metadata))]
# create grouping factor that will place each sample in the one of three tissues i.e.
# midgut (MG), proventriculus(PV) and salivary glands (SG)
tissue <- factor(c("MG", "MG", "MG", "MG", "MG", "PV", "PV", "SG", "SG", "SG", "SG",
"MG", "MG", "PV", "SG", "SG", "PV"))
# append factor to sample.metadata to group samples
sample.metadata["Tissue"] <- tissue
# The sample below was analysed separately as the reads are paired-end while
# the other samples are single-end.
#
# Add sample from Telleria et al 2014 study (SRR965341) to sample metadata.
sample.metadata$Run <- as.character(sample.metadata$Run)
sample.metadata <- rbind(sample.metadata, "18" = c("SA1", "SRR965341", "SA1", "SG"))
#####################################################################################
# Exclusion of the following samples was done after analysis showed they
# were below required standards
# remove a sample with less than 10M reads from the analysis (SRR039951)
sample.metadata <- sample.metadata[-15,]
# remove 2 samples that have technical duplicates (SRR039937 and SRR039938)
sample.metadata <- sample.metadata[-9,]
sample.metadata <- sample.metadata[-8,]
#####################################################################################
# print out the sample metadata table
sample.metadata
## Library_Name Run Sample_Name Tissue
## 1 mg1 SRR039378 MG1 MG
## 2 mg1 SRR039381 MG1 MG
## 3 mg1 SRR039453 MG1 MG
## 4 MG2_SL SRR039454 MG2 MG
## 5 MG2_SL SRR039455 MG2 MG
## 6 PV2_SL SRR039456 PV2 PV
## 7 PV2_SL SRR039457 PV2 PV
## 10 SA2_SL SRR039939 SA2 SG
## 11 SA2_SL SRR039940 SA2 SG
## 12 mg1 SRR039948 MG1 MG
## 13 MG2_SL SRR039949 MG2 MG
## 14 PV2_SL SRR039950 PV2 PV
## 16 SA2_SL SRR039952 SA2 SG
## 17 PV2_SL SRR042429 PV2 PV
## 18 SA1 SRR965341 SA1 SG
Next, RNASeq data is downloaded from EBI database’s FTP site.
cat ../scripts/fastq_download.sh
## #!/bin/bash
## #
## #Script to download fastq files from European Nucleotide Archive
## #
## FILE=$1 #File containing fastq url links to EBI FTP site
##
## OUT_DIR=../data/raw_data/
##
## cat ${FILE} | xargs -n1 wget$2 -P ${OUT_DIR}
Some of the downstream tools require that FASTQ files that were downloaded in zipped form are unzipped.
cat ../scripts/unzip.sh
## #!/bin/bash
## #
## #Script to decompress fastq.gz files
## #
## FASTQ_FILES=../data/raw_data/*.fastq.gz
##
## for file in ${FASTQ_FILES}; do
## gunzip ${file}
## done
After downloading the RNASeq data, its quality is checked through the FASTQC tool whose output is a report in HTML format.
cat ../scripts/fastqc_reports.sh
## #!/bin/bash
## #
## #Script to run FastQC reports using the FastQC tool
## #
## #load fastqc module
## module load fastqc/0.11.4
##
## FASTQ_DIR=../data/raw_data/*.fastq
##
## #create output directory if it doesn't exist.
## mkdir -p ../results/fastqc_reports
##
## REPORTS_DIR=../results/fastqc_reports/
##
## for file in ${FASTQ_DIR}; do
## fastqc -f fastq -o ${REPORTS_DIR} ${file}
## done
Following the high rate of duplicate reads after FASTQC analysis, further analysis is done to ascertain their cause. Duplicate reads are assessed whether they arise from artifacts in PCR (PCR duplicates) or from biological causes (highly expressed genes). This is done later in the analysis after read mapping. Below is the sequence duplication level plot.
See more on duplication here.
Genomes are obtained from their respective databases before alignment. The steps to obtain Glossina morsitans genome and its annotation file is indicated here as part of documenting the process. It is used in the analysis for reasons stated further in the reads mapping section. The genome and annotation files are downloaded from the TriTrypDB and vectorbase databases as follows:
#Downloading T. brucei genome
wget https://tritrypdb.org/common/downloads/release-43/TbruceiTREU927/fasta/data/TriTrypDB-43_TbruceiTREU927_Genome.fasta \
-P ../data/tbrucei_genome/
#Downloading the GFF file
wget https://tritrypdb.org/common/downloads/release-43/TbruceiTREU927/gff/data/TriTrypDB-43_TbruceiTREU927.gff \
-P ../data/tbrucei_genome_annotations_GFF/
# convert the tbrucei gene annotation from GFF format to GTF (required by some downstream tools)
# uses gffread from cufflinks
mkdir -p ../data/tbrucei_genome_annotations_GTF
gffread ../data/tbrucei_genome_annotations_GFF/TriTrypDB-43_TbruceiTREU927.gff \
-T -o ../data/tbrucei_genome_annotations_GTF/TriTrypDB-43_TbruceiTREU927.gtf
# Download T. brucei annotated transcripts (for use in UTR motif discovery)
wget https://tritrypdb.org/common/downloads/release-43/TbruceiTREU927/fasta/data/TriTrypDB-43_TbruceiTREU927_AnnotatedTranscripts.fasta \
-P ../data/tbrucei_annotated_transcripts/
# Downloading Glossina genome
wget https://www.vectorbase.org/download/glossina-morsitans-yalescaffoldsgmory1fagz \
-P ../data/glossina_genome_scaffolds/
# Downloading GTF file
wget https://www.vectorbase.org/download/glossina-morsitans-yalebasefeaturesgmory19gtfgz \
-P ../data/glossina_genome_annonations_GTF/
The first step is indexing the genome using HISAT2 followed by alignment of the reads. The output is SAM files.
cat ../scripts/hisat2_index.sh
## #!/bin/bash
## #
## #Script to index genome using HISAT2
## #
## GENOME_FILE=$1
##
## hisat2-build ${GENOME_FILE} bru-mor_genome_index_hisat2
cat ../scripts/hisat2_align.sh
## #!/bin/bash
## #
## #Script to align reads to the indexed genome using HISAT2
## #
## for fastq in ../data/raw_data/*.fastq; do
## fqname=$(echo $fastq | cut -f1 -d '.')
##
## hisat2 \
## -x bru-mor_genome_index_hisat2 \
## -U ${fastq} \
## -S ${fqname}.sam \
## -p 8 \
## --summary-file ${fqname}.txt \
## --new-summary
## done
##
## #move the output sam files to a new directory
##
## #create directory if not exists
## mkdir -p ../data/processed_data/bru-mor_sam
##
## #move the sam files
## mv ../data/raw_data/*.sam ../data/processed_data/bru-mor_sam/
Following the high rate of unaligned reads, further analysis is done to ascertain their cause. The reads are aligned to Glossina morsitans genome to determine whether reads from the vector were also present in the sample during sequencing (Dual RNA-Seq). T. brucei and G. morsitans genome files are concatenated into a single fasta file which is used during the alignment of the reads. This also ensures no cross-mapping of reads take place.
# make a directory to store the concatenated genomes
mkdir -p ../data/brucei-morsitans
# copy the genome files to the created directory and concatenate them
cp ../data/glossina_genome_scaffolds/glossina-* ../data/brucei-morsitans/
cp ../data/tbrucei_genome/*.fasta ../data/brucei-morsitans/
cat ../data/brucei-morsitans/*.fa* > ../data/brucei-morsitans/brucei-morsitans_genomes.fasta
HISAT2 is used to re-align reads on the T. brucei and G. morsitans genomes, begining with indexing the genomes. Below is the alignment results from HISAT2 alignment tool.
At this point, quality control to assess the duplication rate can be performed. First, the SAM files are converted to sorted BAM files required by dupRadar tool.
cat ../scripts/sam-to-bam.sh
## #!/bin/bash
## #
## #Script to convert sam files to bam files
## #
## for sam_file in ../data/processed_data/bru-mor_sam/*.sam; do
## sam_file_name=$(echo $sam_file | cut -f1 -d '.')
## samtools view -S -b $sam_file > ${sam_file_name}.bam
## done
##
## # move the created bam files to a new directory
## mkdir -p ../data/processed_data/bru-mor_bam
##
## mv ../data/processed_data/bru-mor_sam/*.bam ../data/processed_data/bru-mor_bam/
The BAM files are then sorted using samtools
cat ../scripts/sort_bam.sh
## #!/bin/bash
## #
## #Script to sort BAM file
## #
## for bam_file in ../data/processed_data/bru-mor_bam/*.bam; do
## bam_file_name=$(echo $bam_file | cut -f1 -d '.')
## samtools sort $bam_file -o ${bam_file_name}.sorted.bam
## done
Next, duplicates are marked in the BAM files using Picard.
cat ../scripts/mark_dupes.sh
## #!/bin/bash
## #
## #Script to mark duplicates in BAM files using picard
## #
## for bam_file in ../data/processed_data/bru-mor_bam/*.sorted.bam; do
## bam_file_name=$(echo $bam_file | cut -f1 -d '.')
##
## /opt/apps/picard-tools/1.119/bin/MarkDuplicates \
## I=$bam_file \
## O=${bam_file_name}.dupMarked.bam \
## M=${bam_file_name}.dupMetrics.txt
## done
At this point, dupRadar tool is used to perform quality control in R. Before this quality control can be performed, we need to verify whether the reads are stranded or not as this is a required parameter for dupRadar as well as HTSeq tool later in the analysis. This can be done using RSeQC package - An RNA-seq Quality Control Package. infer_experiment.py module is used in this case. RSeQC documentation and tutorial can be found here.
First we convert T. brucei genome annotation GTF file into bed format required by RSeQC package. Then we use infer_experiment.py to verify strandedness using a few samples.
# convert GTF genome annotation to BED format using a custom script from:
#https://github.com/ExpressionAnalysis/ea-utils/tree/master/clipper
../scripts/gtf2bed.pl ../data/tbrucei_genome_annotations_GTF/TriTrypDB-43_TbruceiTREU927.gtf > \
../data/tbrucei_genome_annotations_GTF/TriTrypDB-43_TbruceiTREU927.bed
infer_experiment.py -i ../data/processed_data/bru-mor_bam/SRR039381.bam \
-r ../data/tbrucei_genome_annotations_GTF/TriTrypDB-43_TbruceiTREU927.bed
# output
#This is SingleEnd Data
#Fraction of reads failed to determine: 0.0008
#Fraction of reads explained by "++,--": 0.3832
#Fraction of reads explained by "+-,-+": 0.6161
The next step is to run the dupRadar quality control analysis setting the stranded parameter as FALSE as the reads are not strand specific. Tutorial for the tool can be found here
# Parameters
bam_file <- "../data/processed_data/bru-mor_bam/SRR039951.dupMarked.bam"
gtf_file <- "../data/brucei-morsitans/brucei-morsitans_annotations.gtf"
stranded <- 0
paired <- FALSE
threads <- 12
# Duplication rate ananlysis
dm <- analyzeDuprates(bam_file, gtf_file, stranded, paired, threads)
#Plots
png(filename = "../figures/duplication_rate/SRR039951.png")
duprateExpDensPlot(DupMat=dm)
title("SRR039951")
dev.off()
# Boxplot
duprateExpBoxplot(DupMat=dm)
Below are representative plots of samples with PCR duplicates and without PCR duplicates. Samples with PCR duplicates are removed from the analysis before counts data are read into R.
Sample without PCR duplicates
Sample with PCR duplicates
HTSeq tool is used to count reads that aligned to the T. brucei genome. T. brucei annotation file is used and therefore HTSeq excludes counting G. morsitans reads that aligned to Glossina genome. The output is a text file for each sample that contains the number of reads that were counted for each gene.
cat ../scripts/htseq_counts.sh
## #!/bin/bash
## #
## #Script to counts the number of reads aligned to T. brucei genome using HTSeq.
## #Resource: HTSeq documentation https://htseq.readthedocs.io/en/latest/count.html
## #
## module load htseq/0.11.2
##
## #create output directory if it doesn't exist
## mkdir -p ../results/brucei_HTSeq_count_results
##
## GFF_FILE=$1
##
## for bam_file in ../data/processed_data/bru-mor_bam/*.bam; do
## bam_file_name=$(echo $bam_file | cut -f1 -d '.')
##
## python /opt/apps/htseq/0.11.2/bin/htseq-count \
## -f bam \
## -s no \
## -t exon \
## -i Parent \
## $bam_file \
## $GFF_FILE \
## > ../results/brucei_HTSeq_count_results/${bam_file_name}.counts.txt
## done
Below is a graphical visualization of reads assignment by HTSeq.
HTSeq reads assignment
MultiQC aggregates results from FASTQC, HISAT2 and HTSeq analysis into an HTML formatted single report for better visualization.
#change directory to results
cd ../results
#Run multiqc
multiqc .
# create a directory for the multiQC report and move the output there.
mkdir -p brucei_multiqc_report
mv multiqc* brucei_multiqc_report/
Before loading the data into R, filter out the non-protein coding genes which include ncRNA, snRNA, snoRNA, pseudogenic transcripts, rRNA and tRNA.
cat ../scripts/exclude_features.sh
## #!/bin/bash
## # script to exclude gene features from reads counts
##
## # create directory for the filtered counts
## mkdir -p ../results/brucei_HTSeq_count_results_mRNA
##
## for file in ../results/brucei_HTSeq_count_results/*.counts.txt; do
## counts_file=$(basename "$file" .counts.txt)
## grep -v -f ../results/excluded_features.txt ${file} > \
## ../results/brucei_HTSeq_count_results_mRNA/"${counts_file}".counts.txt
## done
For further analysis, samples read counts are read into R. To read the sample counts data into R using the script below, simply type source("../scripts/htseq-combine_all.R") on the R console and hit enter. Here, ensure the samples to be excluded in the analysis (SRR039951, SRR039937 and SRR039938) are not among the input files.
cat ../scripts/htseq-combine_all.R
## #!/usr/bin/Rscript
##
## # Take 'all' htseq-count results and melt them in to one big dataframe
##
## #Adapted from: https://wiki.bits.vib.be/index.php/NGS_RNASeq_DE_Exercise.4
##
## # where are we?
## basedir <- "../results"
## setwd(basedir)
##
## cntdir <- paste(basedir, "brucei_HTSeq_count_results_mRNA", sep="/")
## pat <- ".counts.txt"
## hisat2.all <- list.files(path = cntdir,
## pattern = pat,
## all.files = TRUE,
## recursive = FALSE,
## ignore.case = FALSE,
## include.dirs = FALSE)
##
## # we choose the 'all' series
## myfiles <- hisat2.all
## DT <- list()
##
## # read each file as array element of DT and rename the last 2 cols
## # we created a list of single sample tables
## for (i in 1:length(myfiles) ) {
## infile = paste(cntdir, myfiles[i], sep = "/")
## DT[[myfiles[i]]] <- read.table(infile, header = F, stringsAsFactors = FALSE)
## cnts <- gsub("(.*).counts.txt", "\\1", myfiles[i])
## colnames(DT[[myfiles[i]]]) <- c("ID", cnts)
## }
##
## # merge all elements based on first ID columns
## data <- DT[[myfiles[1]]]
##
## # inspect
## #head(data)
##
## # we now add each other table with the ID column as key
## for (i in 2:length(myfiles)) {
## y <- DT[[myfiles[i]]]
## z <- merge(data, y, by = c("ID"))
## data <- z
## }
##
## # ID column becomes rownames
## rownames(data) <- data$ID
## data <- data[,-1]
##
## ## add total counts per sample
## data <- rbind(data, tot.counts=colSums(data))
##
## # inspect and look at the top row names!
## #head(data)
##
## #tail(data)
##
## ####################################
## # take summary rows to a new table
## # ( not starting with Tb and tmp with invert=TRUE )
##
## # transpose table for readability
## data.all.summary <- data[grep("^Tb|^tmp", rownames(data), perl=TRUE, invert=TRUE), ]
##
## # review
## #data.all.summary
##
## # transpose table
## t(data.all.summary)
##
## # write summary to file
## write.csv(data.all.summary, file = "brucei_htseq_counts_all-summary.csv")
##
## ####################################
## # take all data rows to a new table
##
## data.all <- data[grep("^Tb|^tmp", rownames(data), perl=TRUE, invert=FALSE), ]
##
## # inspect final merged table
## #head(data.all, 3)
##
## # write data to file
## write.table(data.all, file = "brucei_htseq_counts_all.txt", quote = FALSE, sep = "\t")
##
## # cleanup intermediate objects
## rm(y, z, i, DT)
##
## #return to analysis directory
## setwd("../analysis")
The quality of the samples is checked before further analysis to check for outlier and batch effects.
# Remove extra column from sample SRR965341 added while reading data into R
data.all["Var.2"] <- NULL
# Create a DGEList object
counts <- DGEList(data.all, group = sample.metadata$Tissue)
# check the number of genes with no expression in all samples
table(rowSums(counts$counts==0)==15)
#FALSE TRUE
# 9184 792
##################################################################
#used the alternative function below instead of this
#
# filter out Non-expressed genes and those with low expression values.
# keep genes with at least 1 read per million in n samples, where n
# is minimum number of replicates which is 3 in this case
keep <- rowSums(cpm(counts)>1) >= 3
# retain genes above the cpm value
#filtered.counts <- counts[keep, , keep.lib.sizes=FALSE]
##################################################################
# Filtering non-expressed and lowly-expressed genes.
#
# Alternative filtering function that filters better (Law et al 2016)
# retain genes above a calculated cpm value
keep.exprs <- filterByExpr(counts, group=sample.metadata$Sample_Name)
filtered.counts <- counts[keep.exprs,, keep.lib.sizes=FALSE]
# obtain logCPM unnormalized for plotting purposes.
# Here, the norm.factors value is 1 for all samples
logcpm.unnorm.counts <- cpm(filtered.counts, log = TRUE, prior.count = 2, normalized.lib.sizes = TRUE)
# Normalize for composition bias using TMM
filtered.counts <- calcNormFactors(filtered.counts, method = 'TMM')
# Convert counts per million per gene to log counts per million for further downstream analysis.
logcpm.norm.counts <- cpm(filtered.counts, log = TRUE, prior.count = 2, normalized.lib.sizes = TRUE)
Various plots are made for the samples before and after normalization.
Samples heatmap
sample_category <- nlevels(sample.metadata$Sample_Name)
colour.palette <- colorRampPalette(brewer.pal(sample_category, "Set2"))(sample_category)
sample.colours <- colour.palette[as.integer(sample.metadata$Sample_Name)]
# Unnormalized sample heatmap
png(filename = "../figures/unnorm_sample_heatmap.png", res =1200, type = "cairo", units = 'in',
width = 5, height = 4, pointsize = 10)
heatmap.2(cor(logcpm.unnorm.counts), RowSideColors=sample.colours, trace='none',
main='Sample correlations')
dev.off()
# Normalized sample heatmap
png(filename = "../figures/norm_sample_heatmap.png", res =1200, type = "cairo", units = 'in',
width = 5, height = 4, pointsize = 10)
heatmap.2(cor(logcpm.norm.counts), RowSideColors=sample.colours, trace='none',
main='Sample correlations')
dev.off()
Filtered-Unnormalized sample correlation heatmap
Normalized sample correlation heatmap
Samples density plot
# Checking further using sample density plot
# raw data
log.counts <- log2(counts$counts + 1)
png("../figures/raw_sample_density.png", res =1200, type = "cairo", units = 'in',
width = 6, height = 6, pointsize = 10)
x <- melt(as.matrix(log.counts))
colnames(x) <- c('gene_id', 'sample', 'log')
ggplot(x, aes(x=log, color=sample)) + geom_density()
dev.off()
# filtered and unnormalized sample data
png("../figures/unnorm_sample_density.png", res =1200, type = "cairo", units = 'in',
width = 6, height = 4, pointsize = 10)
x <- melt(as.matrix(logcpm.unnorm.counts))
colnames(x) <- c('gene_id', 'sample', 'logcpm')
ggplot(x, aes(x=logcpm, color=sample)) + geom_density()
dev.off()
# filtered and normalized sample data
png("../figures/norm_sample_density.png", res =1200, type = "cairo", units = 'in',
width = 6, height = 4, pointsize = 10)
x <- melt(as.matrix(logcpm.norm.counts))
colnames(x) <- c('gene_id', 'sample', 'logcpm')
ggplot(x, aes(x=logcpm, color=sample)) + geom_density()
dev.off()
Raw sample density plot
Normalized sample density plot
Principal component analysis
# PCA
# raw samples PCA
pca.log.counts <- prcomp(t(log.counts)) # raw data (unnormalized and unfiltered)
png(filename = "../figures/raw_samples_PCA.png", res =1200, type = "cairo", units = 'in',
width = 6, height = 4, pointsize = 10)
autoplot(pca.log.counts,
data = sample.metadata,
colour="Sample_Name",
size=3)
dev.off()
# unnormalized samples PCA
pca.log.counts <- prcomp(t(logcpm.unnorm.counts)) #unnormalized & filtered
png(filename = "../figures/unnorm_sample_PCA.png", res =1200, type = "cairo", units = 'in',
width = 6, height = 4, pointsize = 10)
autoplot(pca.log.counts,
data = sample.metadata,
colour="Sample_Name",
size=3)
dev.off()
# normalized samples PCA
pca.log.counts <- prcomp(t(logcpm.norm.counts)) #normalized & filtered
png(filename = "../figures/norm_sample_PCA.png", res =1200, type = "cairo", units = 'in',
width = 6, height = 4, pointsize = 10)
autoplot(pca.log.counts,
data = sample.metadata,
colour="Sample_Name",
size=3)
dev.off()
Raw Sample PCA
Filtered-Unnormalized sample PCA
Normalized sample PCA
Boxplot
# raw sample boxplot
png(filename = "../figures/raw_sample_boxplot.png", res =1200, type = "cairo", units = 'in',
width = 4, height = 4, pointsize = 6)
y <- melt(as.matrix(log.counts))
colnames(y) <- c('gene_id', 'sample', 'log')
ggplot(y, aes(x=sample, y=log)) + geom_boxplot() +
theme(axis.text.x = element_text(angle=90, vjust=0.5))
dev.off()
# unnormalized sample boxplot
png(filename = "../figures/unnorm_sample_boxplot.png", res =1200, type = "cairo", units = 'in',
width = 4, height = 4, pointsize = 6)
y <- melt(as.matrix(logcpm.unnorm.counts))
colnames(y) <- c('gene_id', 'sample', 'logcpm')
ggplot(y, aes(x=sample, y=logcpm)) + geom_boxplot() +
theme(axis.text.x = element_text(angle=90, vjust=0.5))
dev.off()
# normalized sample boxplot
png(filename = "../figures/norm_sample_boxplot.png", res =1200, type = "cairo", units = 'in',
width = 4, height = 4, pointsize = 6)
y <- melt(as.matrix(logcpm.norm.counts))
colnames(y) <- c('gene_id', 'sample', 'logcpm')
ggplot(y, aes(x=sample, y=logcpm)) + geom_boxplot() +
theme(axis.text.x = element_text(angle=90, vjust=0.5))
dev.off()
Filtered-Unnormalized Sample boxplot
Normalized sample boxplot
# Apply sample grouping based on Tissue from which the sample was derived
design <- model.matrix(~0+sample.metadata$Tissue)
colnames(design) <- levels(sample.metadata$Tissue)
# Estimate dispersions for tags
filtered.counts <- estimateDisp(filtered.counts, design, robust = TRUE)
# Fit a generalized likelihood model to the DGELIST using sample grouping
fit <- glmFit(filtered.counts,design)
#################################################################
# code in this section adapted from https://github.com/iscb-dc-rsg/2016-summer-workshop
# generate a list of all possible pairwise contrasts
condition_pairs <- t(combn(levels(sample.metadata$Tissue), 2))
comparisons <- list()
for (i in 1:nrow(condition_pairs)) {
comparisons[[i]] <- as.character(condition_pairs[i,])
}
# remove MG to SG comparison
comparisons[[2]] <- NULL
# vector to store differentially expressed genes
sig_genes <- c()
# iterate over the contrasts, and perform a differential expression test for
# each pair
for (conds in comparisons) {
# generate string contrast formula
contrast_formula <- paste(conds, collapse=' - ')
contrast_mat <- makeContrasts(contrasts=contrast_formula, levels=design)
contrast_lrt <- glmLRT(fit, contrast=contrast_mat)
topGenes <- topTags(contrast_lrt, n=Inf, p.value=0.05, adjust.method = "BH")
# Grab highly ranked genes
sig_genes <- union(sig_genes, rownames(topGenes$table))
}
# Filter out genes which were not differentially expressed for any contrast
de.genes <- filtered.counts[rownames(filtered.counts) %in% sig_genes,]
dim(de.genes$counts)
#4188 15
################################################################
# Obtain the counts of genes expressed for each contrast individually
# This aims to obtain the number of genes differentially expressed between
# the 3 stages of development i.e. MG -> PV, PV -> SG
# Likelihood ratio test to identify DEGs
# SG compared to PV
SG_vs_PV_lrt <- glmLRT(fit, contrast=c(0,-1,1))
# PV compared to MG
PV_vs_MG_lrt <- glmLRT(fit, contrast = c(-1,1,0))
# Genes with most significant differences (using topTags)
# SG compared to PV
topGenes_SG <- topTags(SG_vs_PV_lrt, adjust.method = "BH", p.value = 0.05, n=Inf)
dim(topGenes_SG)
#2973 5
# PV compared to MG
topGenes_PV <- topTags(PV_vs_MG_lrt, adjust.method = "BH", p.value = 0.05, n=Inf)
dim(topGenes_PV)
#2543 5
#Total number of genes: 5516
#######################################################################################
# DE genes at 5% FDR (using decideTestsDGE function)
#
# SG compared to PV
SG_vs_PV_de.genes <- decideTestsDGE(SG_vs_PV_lrt, adjust.method = "BH", p.value = 0.05)
# get summary
summary(SG_vs_PV_de.genes)
# -1*PV 1*SG
#Down 1375
#NotSig 4417
#Up 1598
# PV compared to MG
PV_vs_MG_de.genes <- decideTestsDGE(PV_vs_MG_lrt, adjust.method = "BH", p.value = 0.05)
# summary
summary(PV_vs_MG_de.genes)
# -1*MG 1*PV
#Down 1362
#NotSig 4847
#Up 1181
# DE genes in the PV that are common in both comparisons
de.common <- which(PV_vs_MG_de.genes!=0 & SG_vs_PV_de.genes!=0)
length(de.common)
#1328
# create a dataframe with data on PV and SG differential gene expression
PV_data <- topGenes_PV$table
SG_data <- topGenes_SG$table
# append status of regulation for each gene (either upregulated, 1 or downregulated, -1)
## appending result was erroneous; status (+ or -) did not coincide with logFC sign (+ or -). why?
PV_data$RegulationStatus <- PV_vs_MG_de.genes[rownames(PV_vs_MG_de.genes) %in% rownames(PV_data),]
SG_data$RegulationStatus <- SG_vs_PV_de.genes[rownames(SG_vs_PV_de.genes) %in% rownames(SG_data),]
# obtain upregulated and downregulated genes and write out to excel
PV_DownReg <- PV_data[order(PV_data$logFC),]
PV_UpReg <- PV_data[order(PV_data$logFC, decreasing = TRUE),]
SG_DownReg <- SG_data[order(SG_data$logFC),]
SG_UpReg <- SG_data[order(SG_data$logFC, decreasing = TRUE),]
write.xlsx(PV_DownReg, file = "../results/Significant_differentially_expressed_genes.xlsx",
sheetName = "Down-regulated Proventriculus")
write.xlsx(PV_UpReg, file = "../results/Significant_differentially_expressed_genes.xlsx",
sheetName = "Up-regulated Proventriculus", append = TRUE)
write.xlsx(SG_DownReg, file = "../results/Significant_differentially_expressed_genes.xlsx",
sheetName = "Down-regulated Salivary glands", append = TRUE)
write.xlsx(SG_UpReg, file = "../results/Significant_differentially_expressed_genes.xlsx",
sheetName = "Up-regulated Salivary glands", append = TRUE)
Plotting to visually inspect differential gene expression results.
# Differential expression analysis - plots
#
# Volcano plots
#table with negative log to base 10 transformed p-values and logFC
# create a plot for each comparison PV-MG and SG-PV one at a time.
tab <- data.frame(logFC = topGenes_PV$table$logFC,
negLogPval = -log10(topGenes_PV$table$PValue)) # PV compared to MG
tab <- data.frame(logFC = topGenes_SG$table$logFC,
negLogPval = -log10(topGenes_SG$table$PValue)) # SG compared to PV
# generating the plot
png("../figures/PV-MG_DEG_volcanoplot.png", res =1200, type = "cairo", units = 'in',
width = 4, height = 4, pointsize = 8)
par(mar = c(5, 4, 4, 4))
plot(tab, pch = 16, cex = 0.6, xlab = expression(log[2]~fold~change),
ylab = expression(-log[10]~pvalue), main = "PV vs MG differentially expressed genes")
## Log2 fold change and p-value cutoffs
lfc <- 2
pval <- 0.05
## Select interesting genes
signGenes <- (abs(tab$logFC) > lfc & tab$negLogPval > -log10(pval))
## Identifying the selected genes
points(tab[signGenes, ], pch = 16, cex = 0.8, col = "red")
abline(h = -log10(pval), col = "green3", lty = 2)
abline(v = c(-lfc, lfc), col = "blue", lty = 2)
mtext(paste("pval =", pval), side = 4, at = -log10(pval), cex = 0.8, line = 0.5, las = 1)
mtext(c(paste("-", lfc, "fold"), paste("+", lfc, "fold")), side = 3, at = c(-lfc, lfc),
cex = 0.8, line = 0.5)
dev.off()
# create a venn diagram to show distribution of the number DEGs between stages
Differentially expressed genes in the proventriculus compared to midgut
Differentially expressed genes in the salivary gland compared to proventriculus
# obtain the required counts data (WGCNA input)
# WGCNA requires genes to be in columns
network.counts <- t(logcpm.norm.counts)
# determine the soft-thresholding power to use
powers <- c(c(1:10), seq(from = 12, to=20, by=2))
sft <- pickSoftThreshold(network.counts, powerVector = powers, verbose = 5)
# results
#sizeGrWindow(9, 5)
par(mfrow = c(1,2))
cex1 <- 0.9
# Scale-free topology fit index as a function of the soft-thresholding power
png(filename = "../figures/soft-thresholding_power.png", res =1200, type = "cairo", units = 'in',
width = 4, height = 4, pointsize = 10)
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",
ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red");
# The red line corresponds to using an R^2 cut-off of h
abline(h=0.80,col="red")
dev.off()
# Mean connectivity as a function of the soft-thresholding power
png(filename = "../figures/mean_connectivity.png", res =1200, type = "cairo", units = 'in',
width = 4, height = 5, pointsize = 10)
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",
ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")
dev.off()
Plots to determine the soft thresholding power to use.
########################################################################
## Constructing the network
########################################################################
# construct adjacency matrix
softpower <- 12
adjacency.matrix <- adjacency(network.counts, power=softpower,
type = "signed", corFnc = "cor")
#set diagonal to 0 to remove uninformative correlations
diag(adjacency.matrix) <- 0
# Turn the adjacency matrix to topologicaal overlap matrix to minimize
# the effects of noise and spurious associations
TOM <- TOMsimilarity(adjacency.matrix, TOMType = "signed")
dissTOM <- 1 - TOM
# remove adjacency matrix and TOM to free up memory
rm(TOM)
rm(adjacency.matrix)
gc()
# Adjacency matrix heatmap plot / network heatmap of selected genes
heatmap_indices <- sample(nrow(dissTOM), 500) # sub-sample for visualization purposes
png(filename = "../figures/adjacency_matrix_heatmap.png", res =1200, type = "cairo", units = 'in',
width = 5, height = 5, pointsize = 10)
heatmap.2(t(dissTOM[heatmap_indices, heatmap_indices]),
col=redgreen(75),
labRow=NA, labCol=NA,
trace='none', dendrogram='row',
xlab='Gene', ylab='Gene',
main='Adjacency matrix',
density.info='none', revC=TRUE)
dev.off()
Adjacency matrix heatmap (500 genes)
################################################################
## Detecting co-expression modules in R
################################################################
# view the dendrogram based on hierachical clustering of genes
gene.tree <- flashClust(as.dist(dissTOM), method = "average")
# plot the gene tree
png(filename = "../figures/gene_tree.png", res =1200, type = "cairo", units = 'in',
width = 7, height = 8, pointsize = 10)
#sizeGrWindow(12,9) #open graphical window
plot(gene.tree, xlab="", sub="", main = "Gene clustering based on TOM dissimilarity",
labels = FALSE, hang = 0.04)
dev.off()
# identify the modules
module.labels <- cutreeDynamicTree(gene.tree, deepSplit = FALSE,
minModuleSize = 30)
#view
table(module.labels)
# convert labels to colours
module.colours <- labels2colors(module.labels)
# view
table(module.colours)
# a list of 35 modules
black blue brown cyan darkgreen
333 611 495 219 118
darkgrey darkmagenta darkolivegreen darkorange darkred
109 33 44 103 130
darkturquoise green greenyellow grey grey60
111 341 272 19 155
lightcyan lightgreen lightyellow magenta midnightblue
193 155 143 290 210
orange paleturquoise pink purple red
106 73 291 282 340
royalblue saddlebrown salmon skyblue steelblue
136 85 234 99 80
tan turquoise violet white yellow
265 693 68 101 453
# visualize the gene tree and TOM matrix together using TOM plot
# if necessary, raise dissTOM to a power to make moderately strong connection more visible in heatmap
png(filename = "../figures/gene_tree_and_TOM.png", res =1200, type = "cairo", units = 'in',
width = 5, height = 6, pointsize = 10)
TOMplot(dissTOM, gene.tree, as.character(module.colours))
dev.off()
# plot gene dendrogram
png(filename = "../figures/gene_tree_and_colours.png", res =1200, type = "cairo", units = 'in',
width = 5, height = 6, pointsize = 10)
#sizeGrWindow(8,6) #open graphical window
plotDendroAndColors(gene.tree, module.colours, "Dynamic Tree Cut", dendroLabels = FALSE,
hang = 0.03, addGuide = TRUE, guideHang = 0.05,
main = "Gene dendrogram and module colours")
dev.off()
# get hub genes
module.hub.genes <- chooseTopHubInEachModule(network.counts, module.colours,
power = 12,type = "signed") #use softhresholding power
# A list of module hub genes
black blue brown
"Tb10.v4.0128:mRNA" "Tb927.1.3130:mRNA" "Tb927.3.1630:mRNA"
cyan darkgreen darkgrey
"Tb927.11.1450:mRNA" "Tb927.11.3260:mRNA" "Tb927.9.14480:mRNA"
darkmagenta darkolivegreen darkorange
"Tb927.3.3730:mRNA" "Tb927.11.1570:mRNA" "Tb927.6.2960:mRNA"
darkred darkturquoise green
"Tb927.3.4000:mRNA" "Tb927.7.920:mRNA" "Tb927.2.5810:mRNA"
greenyellow grey60 lightcyan
"Tb927.6.3880:mRNA" "Tb927.10.14760:mRNA" "Tb927.11.900:mRNA"
lightgreen lightyellow magenta
"Tb927.8.5550:mRNA" "Tb927.10.10770:mRNA" "Tb927.10.12780:mRNA"
midnightblue orange paleturquoise
"Tb927.9.6290:mRNA" "Tb927.9.10100:mRNA" "Tb927.9.5690:mRNA"
pink purple red
"Tb927.8.1650:mRNA" "Tb927.9.5960:mRNA" "Tb927.5.3130:mRNA"
royalblue saddlebrown salmon
"Tb927.6.3150:mRNA" "Tb927.9.2600:mRNA" "Tb927.11.5450:mRNA"
skyblue steelblue tan
"Tb927.6.2790:mRNA" "Tb927.7.2500:mRNA" "Tb927.3.1380:mRNA"
turquoise violet white
"Tb927.9.5240:mRNA" "Tb927.3.3270:mRNA" "Tb927.3.1370:mRNA"
yellow
"Tb927.7.7200:mRNA"
Gene tree and module colours
The section below is included for further checks so it may not be necessary to carry out this analysis.
# --------------------------------------------------------------------------------------------
# merge modules with very similar expression profiles as their genes are highly co-expressed
# get the module eigengenes
module.eigengenes <- moduleEigengenes(network.counts, colors = module.colours)$eigengenes
# calculate dissimilarity of module eigengenes using correlations
module.eigengenes.diss <- 1 - cor(module.eigengenes)
# cluster module eigengenes
module.eigengenes.tree <- flashClust(as.dist(module.eigengenes.diss), method = "average")
# choose height at which to cut the tree for merge i.e. the threshold
module.eigengenes.thresh <- 0.25
# create plots for the results
png(filename = "../figures/module_eigengenes_cluster.png", res =1200, type = "cairo", units = 'in',
width = 5, height = 6, pointsize = 10)
#sizeGrWindow(7, 6)
plot(module.eigengenes.tree, main = "Clustering of module eigengenes", xlab = "", sub = "")
abline(h=module.eigengenes.thresh, col="red")
dev.off()
# merge the modules
module.eigengenes.merge <- mergeCloseModules(network.counts, module.colours,
cutHeight = module.eigengenes.thresh)
# merged module colours
merged.module.colours <- module.eigengenes.merge$colors
# view
table(merged.module.colours)
# a list of 14 modules
black cyan darkgreen darkgrey darkolivegreen
828 583 229 109 780
green greenyellow grey grey60 paleturquoise
794 802 19 370 73
pink royalblue steelblue white
1771 501 80 451
# eigengenes of new merged modules
merged.module.eigengenes <- module.eigengenes.merge$newMEs
# plot the dendrogram with original and merged colours underneath
#sizeGrWindow(12, 9)
png(filename = "../figures/merged-original_colours-original_dendro.png", res =1200, type = "cairo",
units = 'in', width = 5, height = 6, pointsize = 10)
plotDendroAndColors(gene.tree, cbind(module.colours, merged.module.colours),
c("Dynamic Tree Cut", "Merged dynamic"),
dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05)
dev.off()
# plot heatmap of eigengenes (orginal before merge)
png(filename = "../figures/eigengenes_heatmap.png", res =1200, type = "cairo", units = 'in',
width = 5, height = 6, pointsize = 10)
plotEigengeneNetworks(module.eigengenes, "Eigengenes heatmap", marHeatmap = c(3,4,2,2),
plotDendrograms = FALSE, xLabelsAngle = 90)
dev.off()
Dendrogram with original and merged colours
#-----------------------------------------------------------------------------------------
# rename some variables based on the module eigengene analysis for later use
#
# module colours
#module.colours <- merged.module.colours
# construct numerical labels corresponding to the colours
colorOrder <- c("grey", standardColors(50))
module.labels <- match(module.colours, colorOrder)-1
# module eigengenes
#module.eigengenes <- merged.module.eigengenes
# get hub genes
merged.module.hub.genes <- chooseTopHubInEachModule(network.counts,
merged.module.colours,
power = 12,
type = "signed")#use softhresholding power
# a list of merged module hub genes
black cyan darkgreen
"Tb927.11.1690.2:mRNA" "Tb927.4.1390:mRNA" "Tb927.10.12710:mRNA"
darkgrey darkolivegreen green
"Tb927.9.14480:mRNA" "Tb927.4.4730:mRNA" "Tb927.11.760:mRNA"
greenyellow grey60 paleturquoise
"Tb927.11.13610:mRNA" "Tb927.10.14760:mRNA" "Tb927.9.5690:mRNA"
pink royalblue steelblue
"Tb927.5.630:mRNA" "Tb927.10.12750:mRNA" "Tb927.7.2500:mRNA"
white
"Tb927.1.580:mRNA"
##############################################################################
## Network export to cytoscape
##############################################################################
# select modules of interest
interesting.modules <- c('black', 'cyan', 'grey','brown',
'midnightblue','blue','darkgreen','tan', 'darkgrey','darkorange',
'darkred','darkturquoise','green','greenyellow','grey60','lightcyan',
'lightgreen','lightyellow','magenta','orange','pink','purple','red',
'royalblue','salmon','skyblue','turquoise','white','yellow',
'saddlebrown', 'steelblue','darkmagenta','darkolivegreen','paleturquoise',
'violet') # these are all module colours, thus the whole network
# obtain gene ids
gene.ids <- rownames(logcpm.norm.counts)
# select module genes
inModules <- is.finite(match(module.colours, interesting.modules))
modGenes <- gene.ids[inModules]
# select the corresponding dissTOM based on module genes
modTOM <- dissTOM[inModules, inModules]
dimnames(modTOM) <- list(modGenes, modGenes)
# Export the network into edge and node list files Cytoscape can read
cyt <- exportNetworkToCytoscape(modTOM,
edgeFile = "../results/CytoscapeInput-edges.txt",
nodeFile = "../results/CytoscapeInput-nodes.txt",
weighted = TRUE,
threshold = 0,
nodeNames = modGenes,
nodeAttr = module.colours[inModules]);
# remove the cytoscape network
rm(cyt)
# Also, export the network as graphml format
# use export_network_to_graphml function
source("../scripts/network_export_graphml.R")
# the whole network
entire.network <- export_network_to_graphml(dissTOM,
filename = "../results/entire_network.graphml",
threshold = 0,
nodeAttr = gene.ids,
max_edge_ratio = 3)
# network modules
# create a dataframe with node attributes
node.attributes <- cbind(modGenes, module=module.colours)
# Add RGB versions of colour modules
node.attributes$colourRGB <- col2hex(node.attributes$module)
modules.network <- export_network_to_graphml(modTOM,
filename = "../results/modules_network.graphml",
threshold = 0.02,
nodeAttrDataFrame = node.attributes)
Subnetwork of 1399 genes
Module genes and their module labels are required as input by FIRE (Finding Informative Regulatory Elements). Hence we write out text files with this input.
# write out cluster/module genes and their corresponding module labels for use by
# FIRE (Finding Informative Regulatory Elements)
fire.clusters <- data.frame(modGenes, module.labels[inModules])
# sort by module labels; FIRE input should start from 0 in module labels column.
fire.clusters <- fire.clusters[order(fire.clusters$module.labels.inModules.), c(1,2)]
write.table(as.matrix(fire.clusters), file = "../results/tbrucei_FIRE_expression_clusters.txt",
quote = FALSE, row.names = FALSE, sep = "\t")
# Also, write out module genes in a text file.
write.table(data.frame(modGenes), file = "../results/tbrucei_module_genes.txt", row.names = FALSE,
quote = FALSE, col.names = FALSE)
Prepare input sequences for FIRE by obtaining module genes sequences from T. brucei’s entire annotated transcripts. Use seqtk package.
seqtk subseq \
../data/tbrucei_annotated_transcripts/TriTrypDB-43_TbruceiTREU927_AnnotatedTranscripts.fasta \
../results/tbrucei_module_genes.txt > \
../results/tbrucei_module_genes_sequences.fasta
# remove unrequired info from header and retain transcript id.
cut -f1 -d "|" ../results/tbrucei_module_genes_sequences.fasta > \
../results/tbrucei_module_genes_sequences_without_delimiter.fasta
# remove trailing space after header
cut -f1 -d " " ../results/tbrucei_module_genes_sequences.fasta > \
../results/tbrucei_module_genes_sequences_without_delimiter.fasta
# FIRE commands - FIRE should be run in the directory where all scripts were installed. Therefore,
# the input files should be copied to FIRE's directory (FIRE-x.x/)
#
#check whether input files are OK
perl TOOLS/FIRE_analyse_input_files.pl \
-fastafile tbrucei_module_genes_sequences_without_delimiter.fasta \
-expfile tbrucei_FIRE_expression_clusters.txt
# remove sequences >= 10,000 base pairs as they make FIRE crash
# code - https://www.biostars.org/p/62678/
cat ../results/tbrucei_module_genes_sequences_without_delimiter.fasta | \
awk '{y= i++ % 2 ; L[y]=$0; if(y==1 && length(L[1])<=9999) {printf("%s\n%s\n",L[0],L[1]);}}' > \
../results/tbrucei_module_genes_sequences_without_delimiter_filtered.fasta
# recheck input files
perl TOOLS/FIRE_analyse_input_files.pl \
-fastafile tbrucei_module_genes_sequences_without_delimiter_filtered.fasta \
-expfile tbrucei_FIRE_expression_clusters.txt
# output
#Checking the expression file
#Expression file is OK.
#
#Checking the fasta file
#Fasta file is OK (min sequence length = 98, max length = 9917).
#
#Found fasta sequence for 7272 / 7390 identifiers in expression file.
# Analyze sequences for motifs
perl fire.pl --expfiles=tbrucei_FIRE_expression_clusters.txt --exptype=discrete \
--fastafile_dna=tbrucei_module_genes_sequences_without_delimiter_filtered.fasta --nodups=1
# create HTML output of the results
perl MORESCRIPTS/makeresultindex.pl tbrucei_FIRE_expression_clusters.txt "T. brucei cluster motifs"